Initial NLP Work on Film Scripts

Author

Mick Cooney

Published

January 31, 2023

In this workbook we perform the initial NLP pre-processing and simple explorations and visualisations of the data.

1 Load Data

Show code
films_master_tbl <- read_rds("data/films_master_tbl.rds")

films_master_tbl |> glimpse()
Rows: 51
Columns: 9
$ film_title      <chr> "12 Years a Slave", "2001 A Space Odyssey", "A Few Goo…
$ release_year    <int> 2013, 1968, 1991, 1977, 1979, 1997, 1992, 1955, 1974, …
$ genre           <chr> "Drama", "Science Fiction", "Drama", "Comedy", "War", …
$ title_cleaned   <chr> "12_years_a_slave", "2001_a_space_odyssey", "a_few_goo…
$ cached_htmlfile <glue> "scraped_files/scraped_raw/scraped_raw_12_years_a_sla…
$ script_txtfile  <glue> "scraped_files/scraped_raw/script_12_years_a_slave.tx…
$ cleaned_txtfile <glue> "scraped_files/cleaned_script/cleanedscript_12_years_…
$ flag_cleaned    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,…
$ parsed_file     <glue> "data/parsed_scripts/parsedscript_12_years_a_slave.rd…

The parsed data is stored in the file listed in the column parsed_file and contains to separate tibbles, one with the detailed parsing of the script and the second which aggregates all the text for both scene directions and dialogue.

1.1 Load Sentiments Data

We also want to load the data around the sentiments.

Show code
sentiments_afinn_tbl    <- read_rds("input_data/sentiments_afinn_tbl.rds")
sentiments_afinn_tbl    |> glimpse()
Rows: 2,477
Columns: 2
$ word  <chr> "abandon", "abandoned", "abandons", "abducted", "abduction", "ab…
$ value <dbl> -2, -2, -2, -2, -2, -2, -3, -3, -3, -3, 2, 2, 1, -1, -1, 2, 2, 2…
Show code
sentiments_bing_tbl     <- read_rds("input_data/sentiments_bing_tbl.rds")
sentiments_afinn_tbl    |> glimpse()
Rows: 2,477
Columns: 2
$ word  <chr> "abandon", "abandoned", "abandons", "abducted", "abduction", "ab…
$ value <dbl> -2, -2, -2, -2, -2, -2, -3, -3, -3, -3, 2, 2, 1, -1, -1, 2, 2, 2…
Show code
sentiments_loughran_tbl <- read_rds("input_data/sentiments_loughran_tbl.rds")
sentiments_loughran_tbl |> glimpse()
Rows: 4,150
Columns: 2
$ word      <chr> "abandon", "abandoned", "abandoning", "abandonment", "abando…
$ sentiment <chr> "negative", "negative", "negative", "negative", "negative", …
Show code
sentiments_nrc_tbl      <- read_rds("input_data/sentiments_nrc_tbl.rds")
sentiments_nrc_tbl      |> glimpse()
Rows: 13,872
Columns: 2
$ word      <chr> "abacus", "abandon", "abandon", "abandon", "abandoned", "aba…
$ sentiment <chr> "trust", "fear", "negative", "sadness", "anger", "fear", "ne…

1.2 Pre-process Data

Also, a number of film scripts have not parsed properly, so we also want to create a list of those films and exclude them from the analysis.

We may go back later to improve the parsing and if this happens we will updated this list.

Show code
films_exclude_tbl <- c(
    "12_years_a_slave", "2001_a_space_odyssey", "django_unchained",
    "donnie_brasco", "drive", "gran_torino", "leaving_las_vegas",
    "lock_stock_and_two_smoking_barrels", "moneyball", "office_space",
    "star_wars_return_of_the_jedi", "the_green_mile"
    ) |>
  enframe(name = NULL, value = "title_cleaned")

films_exclude_tbl |> glimpse()
Rows: 12
Columns: 1
$ title_cleaned <chr> "12_years_a_slave", "2001_a_space_odyssey", "django_unch…

Having excluded a number of films from the parsing and future analysis we load in the scripts data and read the data into our table.

Show code
films_parsed_tbl <- films_master_tbl |>
  anti_join(films_exclude_tbl, by = "title_cleaned") |>
  mutate(
    parsed_data = map(parsed_file, read_rds)
    ) |>
  unnest(parsed_data) |>
  select(
    film_title, release_year, genre, title_cleaned, parsing_detailed,
    parsing_aggregated
    )

films_parsed_tbl |> glimpse()
Rows: 39
Columns: 6
$ film_title         <chr> "A Few Good Men", "Airplane", "Apocalypse Now", "Au…
$ release_year       <int> 1991, 1977, 1979, 1997, 1992, 1955, 1974, 1990, 198…
$ genre              <chr> "Drama", "Comedy", "War", "Comedy", "Drama", "Weste…
$ title_cleaned      <chr> "a_few_good_men", "airplane", "apocalypse_now", "au…
$ parsing_detailed   <list> [<tbl_df[8443 x 13]>], [<tbl_df[5646 x 13]>], [<tb…
$ parsing_aggregated <list> [<tbl_df[1794 x 5]>], [<tbl_df[1245 x 5]>], [<tbl_…

2 Initial NLP Processing

We now want to perform some very basic NLP processing such as tokenisation.

Once we have tokenised the script, we also remove “stop words” - that is, common words that do not convey meaning, such as “and”, “to”, “the” and so on.

Show code
data(stop_words)

films_tokens_tbl <- films_parsed_tbl |>
  mutate(
    wordtoken_data = map(
      parsing_aggregated, unnest_tokens,
      output = word, input = trimmed_text
      ),
    ngramtoken_data = map(
      parsing_aggregated, unnest_tokens,
      output = word, input = trimmed_text, token = "ngrams", n = 2, n_min = 1
      )
    ) |>
  select(-parsing_detailed, -parsing_aggregated)

films_wordtoken_unstopped_tbl <- films_tokens_tbl |>
  select(-ngramtoken_data) |>
  unnest(wordtoken_data) |>
  select(-full_text)

films_wordtoken_tbl <- films_wordtoken_unstopped_tbl |>
  anti_join(stop_words, by = "word")

2.1 Show Initial Wordclouds

We now want to create some word clouds as a quick initial visualisation of the data.

Show code
plot_unstopped_tbl <- films_wordtoken_unstopped_tbl |>
  count(word) |>
  slice_max(order_by = n, n = 500)

ggwordcloud2(plot_unstopped_tbl, size = 4, seed = 421)

2.2 Word-stemming

We also want to look at stemming our words.

Show code
films_stems_tbl <- films_wordtoken_tbl |>
  mutate(
    snowball_stem = wordStem(word),
    hunspell_stem = hunspell_stem(word)
    )

films_stems_tbl |> glimpse()
Rows: 341,569
Columns: 10
$ film_title    <chr> "A Few Good Men", "A Few Good Men", "A Few Good Men", "A…
$ release_year  <int> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 19…
$ genre         <chr> "Drama", "Drama", "Drama", "Drama", "Drama", "Drama", "D…
$ title_cleaned <chr> "a_few_good_men", "a_few_good_men", "a_few_good_men", "a…
$ section_title <chr> "Direction", "Direction", "Direction", "Direction", "Dir…
$ grouping_id   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ flag_dialogue <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ word          <chr> "fade", "ext", "sentry", "tower", "middle", "night", "mi…
$ snowball_stem <chr> "fade", "ext", "sentri", "tower", "middl", "night", "mid…
$ hunspell_stem <list> "fade", "ext", "sentry", <"tower", "tow">, "middle", "n…

We also create a word cloud of this stemmed data

Show code
plot_stemmed_tbl <- films_stems_tbl |>
  count(word = snowball_stem) |>
  slice_max(order_by = n, n = 500)

ggwordcloud2(plot_stemmed_tbl, size = 2, seed = 422)

2.3 Contrasting Dialogue and Direction

Finally we look just at the words in the lines of dialogue and focus on this.

Show code
plot_dialogue_tbl <- films_wordtoken_tbl |>
  filter(flag_dialogue == TRUE) |>
  count(word) |>
  slice_max(order_by = n, n = 500)

ggwordcloud2(plot_stemmed_tbl, size = 2, seed = 422)

3 Sentiment Analysis

Sentiment analysis takes the simple approach of assigning some kind of measure of sentiment or emotion to each word, allowing us to quantify these concepts in the text in various ways.

Note that this approach is simplistic: it does not consider context or anything beyond the presence of each word, but it is a quick and simple thing to look at.

There are a number of different sets of sentiment data, so we

3.1 Visualising the NRC Sentiments

We use the NRC sentiments and the count the appearance of each emotion in this dataset.

Show code
sentiments_nrc_tbl |>
  datatable(
    rownames = FALSE,
    caption  = "nrc Sentiments"
    )
Show code
plot_sentiments_tbl <- films_wordtoken_tbl |>
  inner_join(sentiments_nrc_tbl, by = "word") |>
  count(title_cleaned, sentiment)

ggplot(plot_sentiments_tbl) +
  geom_tile(
    aes(x = title_cleaned %>% str_trunc(20), y = sentiment, fill = n)
    ) +
  scale_fill_gradient(low = "blue", high = "red") +
  labs(
    x = "Film Title",
    y = "Sentiment",
    fill = "Raw Count",
    title = "Sentiments in Film Scripts"
    ) +
  theme(axis.text.x = element_text(angle = 20, vjust = 0.5))

Raw counts are interesting, but it is also worth looking at scaling these counts by the total word count of the script, and then plot each of those counts as a ratio of the total word count in the script.

Show code
films_wordcount_tbl <- films_wordtoken_tbl |>
  count(title_cleaned, name = "total_count")

plot_sentiments_ratio_tbl <- films_wordtoken_tbl |>
  inner_join(sentiments_nrc_tbl, by = "word") |>
  count(title_cleaned, sentiment, name = "word_count") |>
  inner_join(films_wordcount_tbl, by = "title_cleaned") |>
  mutate(word_ratio = word_count / total_count)
  
ggplot(plot_sentiments_ratio_tbl) +
  geom_tile(
    aes(x = title_cleaned %>% str_trunc(20), y = sentiment, fill = word_ratio)
    ) +
  scale_fill_gradient(low = "blue", high = "red") +
  labs(
    x = "Film Title",
    y = "Sentiment",
    fill = "Ratio",
    title = "Sentiments in Film Scripts"
    ) +
  theme(axis.text.x = element_text(angle = 20, vjust = 0.5, size = 8))

3.2 Visualising afinn Sentiments

We now repeat the above exercise, but using the sentiment words in the afinn data - in this dataset each word is assigned a positive or negative number of the degree of ‘positivity’ associated with the word.

Show code
sentiments_afinn_tbl |>
  datatable(
    rownames = FALSE,
    caption  = "afinn Sentiments"
    )
Show code
plot_sentiments_tbl <- films_wordtoken_tbl |>
  inner_join(sentiments_afinn_tbl, by = "word") |>
  count(genre, title_cleaned, wt = value, name = "total_sentiment") |>
  inner_join(films_wordcount_tbl, by = "title_cleaned") |>
  mutate(sentiment_ratio = total_sentiment / total_count)


ggplot(plot_sentiments_tbl) +
  geom_col(
    aes(x = title_cleaned %>% str_trunc(20), y = total_sentiment, fill = genre)
    ) +
  scale_fill_brewer(type = "qual", palette = "Set1") +
  labs(
    x = "Film Title",
    y = "Total Sentiment",
    fill = "Genre",
    title = "Total afinn Sentiments by Film"
    ) +
  theme(axis.text.x = element_text(angle = 20, vjust = 0.5, size = 8))

Show code
ggplot(plot_sentiments_tbl) +
  geom_col(
    aes(x = title_cleaned %>% str_trunc(20), y = sentiment_ratio, fill = genre)
    ) +
  scale_fill_brewer(type = "qual", palette = "Set1") +
  labs(
    x = "Film Title",
    y = "Sentiment Ratio",
    fill = "Genre",
    title = "Ratio of afinn Sentiments by Total Word Count by Film"
    ) +
  theme(axis.text.x = element_text(angle = 20, vjust = 0.5, size = 8))

We now want to look at the distribution of sentiment measures in each film.

Show code
plot_sentiments_tbl <- films_wordtoken_tbl |>
  inner_join(sentiments_afinn_tbl, by = "word") |>
  count(genre, title_cleaned, value, name = "value_count") |>
  inner_join(films_wordcount_tbl, by = "title_cleaned") |>
  mutate(
    sentiment_ratio = value_count / total_count
    )

ggplot(plot_sentiments_tbl) +
  geom_col(aes(x = value, y = sentiment_ratio)) +
  facet_wrap(vars(title_cleaned), scales = "free_y") +
  labs(
    x = "Sentiment Value",
    y = "Sentiment Ratio",
    title = "Facet Plot of Distribution of Sentiment Values"
    ) +
  theme(strip.text.x = element_text(size = 8))

3.3 Visualising bing Sentiments

The bing dataset is a list of words with positive and negative sentiment.

Show code
sentiments_bing_tbl |>
  datatable(
    rownames = FALSE,
    caption  = "bing Sentiments"
    )
Show code
plot_sentiments_tbl <- films_wordtoken_tbl |>
  inner_join(sentiments_bing_tbl, by = "word") |>
  count(genre, title_cleaned, sentiment, name = "sentiment_count") |>
  inner_join(films_wordcount_tbl, by = "title_cleaned") |>
  mutate(
    sentiment_ratio = sentiment_count / total_count
    )


ggplot(plot_sentiments_tbl) +
  geom_col(
    aes(x = title_cleaned %>% str_trunc(20), y = sentiment_ratio, fill = sentiment),
    position = "dodge"
    ) +
  scale_fill_brewer(type = "qual", palette = "Set1") +
  labs(
    x = "Film Title",
    y = "Sentiment Ratio",
    fill = "Sentiment",
    title = "Total bing Sentiment by Film"
    ) +
  theme(axis.text.x = element_text(angle = 20, vjust = 0.5, size = 8))

3.4 Visualising loughran Sentiments

The loughran dataset is a list of words with positive and negative sentiment, similar to the bing dataset.

Show code
sentiments_loughran_tbl |>
  datatable(
    rownames = FALSE,
    caption  = "loughran Sentiments"
    )
Show code
plot_sentiments_tbl <- films_wordtoken_tbl |>
  inner_join(sentiments_loughran_tbl, by = "word") |>
  count(genre, title_cleaned, sentiment, name = "sentiment_count") |>
  inner_join(films_wordcount_tbl, by = "title_cleaned") |>
  mutate(
    sentiment_ratio = sentiment_count / total_count
    )


ggplot(plot_sentiments_tbl) +
  geom_col(
    aes(x = title_cleaned %>% str_trunc(20), y = sentiment_ratio, fill = sentiment),
    position = "dodge"
    ) +
  scale_fill_brewer(type = "qual", palette = "Set1") +
  labs(
    x = "Film Title",
    y = "Sentiment Ratio",
    fill = "Sentiment",
    title = "Total loughran Sentiment by Film"
    ) +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))

Show code
ggplot(plot_sentiments_tbl) +
  geom_col(aes(x = title_cleaned %>% str_trunc(20), y = sentiment_ratio)) +
  facet_wrap(vars(sentiment), scales = "free_y") +
  labs(
    x = "Film Title",
    y = "Sentiment Ratio",
    title = "Facet Plot of Distribution of Sentiment Values"
    ) +
  theme(
    axis.text.x = element_text(angle = 30, vjust = 0.5, size = 4),
    strip.text.x = element_text(size = 8)
    )

4 Word and Document Frequency

We now want to look at the use of words within each film and overall as well.

In particular, we also account for differences between stage directions and dialogue, and so we also want to analyse the word tokens without excluding any stop words.

Show code
total_wordfreq_tbl <- films_wordtoken_unstopped_tbl |>
  count(word, sort = TRUE) |>
  mutate(
    freq = n / sum(n),
    rank = row_number()
    )

ggplot(total_wordfreq_tbl) +
  geom_line(aes(x = rank, y = freq)) +
  scale_x_log10() +
  scale_y_log10() +
  labs(
    x = "Word Rank",
    y = "Word Frequency",
    title = "Log-Log Plot of Word Frequency vs Ranking"
    )

Overall, we see that word frequency is following a power-law, and it is worth exploring differences if we segment by individual film.

Show code
total_film_wordfreq_tbl <- films_wordtoken_unstopped_tbl |>
  count(title_cleaned, word, sort = TRUE) |>
  group_by(title_cleaned) |>
  mutate(
    freq = n / sum(n),
    rank = row_number()
    )

ggplot(total_film_wordfreq_tbl) +
  geom_line(aes(x = rank, y = freq, colour = title_cleaned)) +
  scale_x_log10() +
  scale_y_log10() +
  labs(
    x = "Word Rank",
    y = "Word Frequency",
    title = "Log-Log Plot of Film Word Frequency vs Ranking"
    ) +
  theme(legend.position = "none")

4.1 Calculate Term Frequency - Inverse Document Frequency

We now want to look at the statistic known as the term frequency - inverse document frequency, the TF-IDF. This statistic calculates the relative frequency of each token in the corpus but then scales this by the inverse of its frequency of appearance in each document.

The effect of this is to show terms that appear frequently in only a subset of the documents - if a token appears in most or all of the documents, it is heavily downweighted by its high document frequency.

In terms of defining the idea of a ‘document’ in this, we start by considering each separate film to be a document.

Show code
total_film_tfidf_tbl <- films_wordtoken_unstopped_tbl |>
  count(title_cleaned, word, name = "word_count") |>
  bind_tf_idf(word, title_cleaned, word_count)

total_film_tfidf_tbl |>
  slice_max(order_by = tf_idf, n = 500) |>
  mutate(
    tf     = tf     |> round(4),
    idf    = idf    |> round(4),
    tf_idf = tf_idf |> round(4)
    ) |>
  datatable(
    rownames = FALSE,
    caption  = "TF-IDF Statistics for Words"
    )

We see that for each film we have a high TF-IDF for a token that appears to be a character name in the film, so it is worth redoing this, but only looking at the dialogue text.

This requires a bit of processing of the data, as we want to collapse all the dialogue for a given character into a single document.

Show code
films_dialogue_text_tbl <- films_parsed_tbl |>
  select(-parsing_detailed) |>
  unnest(parsing_aggregated) |>
  select(-full_text) |>
  transmute(
    film_title, release_year, genre, title_cleaned,
    section_cleaned = section_title |>
      str_replace_all("\\(.*?\\)", "") |>
      str_squish(),
    flag_dialogue,
    grouping_id,
    trimmed_text
    ) |>
  filter(flag_dialogue == TRUE) |>
  group_by(
    film_title, release_year, genre, title_cleaned, character = section_cleaned
    ) |>
  arrange(grouping_id) |>
  summarise(
    .groups = "drop",
    
    dialogue_text = str_c(trimmed_text, collapse = " ") |> str_squish()
    )

films_dialogue_counts_tbl <- films_dialogue_text_tbl |>
  unnest_tokens(word, dialogue_text) |>
  count(
    film_title, release_year, genre, title_cleaned, character,
    name = "dialogue_count"
    )

films_dialogue_text_tbl <- films_dialogue_text_tbl |>
  inner_join(
    films_dialogue_counts_tbl,
    by = c("film_title", "release_year", "genre", "title_cleaned", "character")
    )

films_dialogue_text_tbl |> glimpse()
Rows: 1,746
Columns: 7
$ film_title     <chr> "A Few Good Men", "A Few Good Men", "A Few Good Men", "…
$ release_year   <int> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1…
$ genre          <chr> "Drama", "Drama", "Drama", "Drama", "Drama", "Drama", "…
$ title_cleaned  <chr> "a_few_good_men", "a_few_good_men", "a_few_good_men", "…
$ character      <chr> "AGENT #1", "DAWSON", "DOWNEY", "GIBBS", "GINNY", "HAMM…
$ dialogue_text  <chr> "Workin' late, lieutenant?", "Sir, Lance Corporal Harol…
$ dialogue_count <int> 3, 456, 301, 53, 5, 90, 308, 1978, 2391, 7, 3, 6935, 36…
Show code
films_dialogue_text_tbl |>
  slice_head(n = 100) |>
  datatable(
    rownames = FALSE,
    caption  = "Film Dialogue Compressed - Character as Document"
  )

We first want to create a token table from this new format of data.

Show code
films_dialogue_wordtoken_tbl <- films_dialogue_text_tbl |>
  unnest_tokens(word, dialogue_text)

films_dialogue_wordtoken_tfidf_tbl <- films_dialogue_wordtoken_tbl |>
  count(title_cleaned, word, name = "word_count") |>
  bind_tf_idf(word, title_cleaned, word_count)

films_dialogue_wordtoken_tfidf_tbl |>
  slice_max(order_by = tf_idf, n = 500) |>
  mutate(
    tf     = tf     |> round(4),
    idf    = idf    |> round(4),
    tf_idf = tf_idf |> round(4)
    ) |>
  datatable(
    rownames = FALSE,
    caption  = "Film Dialogue - TF-IDF Words"
    )

Despite filtering out the non-dialogue text, we are seeing very similar results for the TF-IDF values in the dataset. As many of them are names, this still makes sense, as it is likely that names are used as a part of the dialogue.

4.2 Work on Bi-Gram Data

We now repeat this process, but now look at the bi-grams.

Show code
films_ngrams_tbl <- films_tokens_tbl |>
  select(-wordtoken_data) |>
  unnest(ngramtoken_data)

films_ngrams_tbl |>
  drop_na(word) |>
  select(-full_text) |>
  slice_head(n = 500) |>
  datatable(
    rownames = FALSE,
    caption  = "ngram Tokens"
    )

We now want to look at the data and remove the stopwords where either of the words in the n-gram is on the list.

Show code
films_ngrams_stopped_tbl <- films_ngrams_tbl |>
  drop_na(word) |>
  mutate(
    token_word = word
    ) |>
  separate(token_word, c("word1", "word2"), sep = " ") |>
  anti_join(stop_words, by = c("word1" = "word")) |>
  anti_join(stop_words, by = c("word2" = "word")) |>
  select(-word1, -word2)

films_ngrams_tfidf_tbl <- films_ngrams_stopped_tbl |>
  count(title_cleaned, word, name = "word_count") |>
  bind_tf_idf(word, title_cleaned, word_count)

films_ngrams_tfidf_tbl |>
  slice_max(order_by = tf_idf, n = 500) |>
  mutate(
    tf     = tf     |> round(4),
    idf    = idf    |> round(4),
    tf_idf = tf_idf |> round(4)
    ) |>
  datatable(
    rownames = FALSE,
    caption  = "Film Dialogue - TF-IDF Words"
    )

4.3 Construct Graph Based on Bi-Grams

An alternative approach to looking at this data is to construct a directed graph of words with the edges being determined by the first and second word of the bi-gram.

Show code
bigram_graph_edges_tbl <- films_ngrams_stopped_tbl |>
  filter(flag_dialogue == TRUE) |>
  separate(word, c("word1", "word2"), sep = " ") |>
  drop_na(word2) |>
  count(word1, word2, name = "bigram_count", sort = TRUE) |>
  filter(word1 != word2)

bigram_graph <- bigram_graph_edges_tbl |>
  graph_from_data_frame()

bigram_tblgraph <- bigram_graph |>
  as_tbl_graph() |>
  activate(nodes) |>
  mutate(
    comp_id = group_components()
    ) |>
  group_by(comp_id) |>
  mutate(
    comp_size = n()
    ) |>
  ungroup()

bigram_tblgraph |> print()
# A tbl_graph: 10947 nodes and 20420 edges
#
# A directed simple graph with 472 components
#
# Node Data: 10,947 × 3 (active)
  name  comp_id comp_size
  <chr>   <int>     <int>
1 red         1      9904
2 jesus       1      9904
3 code        1      9904
4 rum         1      9904
5 dr          1      9904
6 obi         1      9904
# … with 10,941 more rows
#
# Edge Data: 20,420 × 3
   from    to bigram_count
  <int> <int>        <int>
1     1     4           52
2     2   368           49
3     3     1           47
# … with 20,417 more rows

To help visualise this graph, we will look at a few of the smaller disjoint components of this chart.

Show code
bigram_tblgraph |>
  filter(comp_id %in% c(2, 3, 4, 5, 6, 7)) |>
  plot()

We now want to run some community detection routines on that largest component of the graph, so we run this now.

Show code
bigram_tblgraph <- bigram_tblgraph |>
  convert(to_subgraph, comp_id == 1) |>
  mutate(
    clust_id = group_walktrap(weights = bigram_count)
    ) |>
  group_by(clust_id) |>
  mutate(
    clust_size = n()
    ) |>
  ungroup()

bigram_tblgraph |> print()
# A tbl_graph: 9083 nodes and 18560 edges
#
# A directed simple graph with 1 component
#
# Node Data: 9,083 × 6 (active)
  name  comp_id comp_size .tidygraph_node_index clust_id clust_size
  <chr>   <int>     <int>                 <int>    <int>      <int>
1 red         1      9083                     1       22         15
2 brick       1      9083                     2       16         20
3 jesus       1      9083                     3        1       2662
4 rum         1      9083                     4       22         15
5 dr          1      9083                     5       18         19
6 john        1      9083                     6        1       2662
# … with 9,077 more rows
#
# Edge Data: 18,560 × 4
   from    to bigram_count .tidygraph_edge_index
  <int> <int>        <int>                 <int>
1     1     4           52                     1
2     2  1152           35                     2
3     3   628           34                     3
# … with 18,557 more rows

We now want to look at a sample of these communities to get an idea of what words are bring connected.

Show code
bigram_tblgraph |>
  filter(clust_id %in% c(16, 18, 22)) |>
  plot()

5 Analysing Dialogue Text

We return to the dialogue text data, we want to explore this data a little more, in particular looking at the characters with the most dialogue.

Show code
plot_tbl <- films_dialogue_text_tbl |>
  transmute(
    character_label = glue("{character} ({film_title})"),
    dialogue_count
    ) |>
  slice_max(order_by = dialogue_count, n = 30)

ggplot(plot_tbl) +
  geom_col(aes(x = character_label, y = dialogue_count)) +
  scale_y_continuous(labels = label_comma()) +
  coord_flip() +
  labs(
    x = "Character",
    y = "Word Count",
    title = "Word Count by Character (Film Title in Parentheses)"
    ) +
  theme(axis.text.y = element_text(size = 10))

We see that most of the films in this plot appear once, with a small number of films appearing twice.

It is worth checking the distribution of word counts in these films as well.

Show code
plot_tbl <- films_dialogue_text_tbl |>
  group_by(film_title, title_cleaned) |>
  slice_max(order_by = dialogue_count, n = 3) |>
  mutate(char_rank = min_rank(-dialogue_count) |> as.character()) |>
  ungroup() |>
  mutate(title_trunc = str_trunc(film_title, width = 20))

ggplot(plot_tbl) +
  geom_col(
    aes(x = title_trunc, y = dialogue_count, fill = char_rank),
    position = "dodge"
    ) +
  scale_y_continuous(labels = label_comma()) +
  scale_fill_brewer(type = "qual", palette = "Set1") +
  coord_flip() +
  labs(
    x = "Film Title",
    y = "Word Count",
    fill = "Rank",
    title = "Distribution of Word Counts by Film Title"
    )

6 Write to Disk

We now want to write a number of of datasets to disk for future analysis.

Show code
films_parsed_tbl |> write_rds("data/films_parsed_tbl.rds")

films_tokens_tbl |> write_rds("data/films_tokens_tbl.rds")

films_dialogue_text_tbl |> write_rds("data/film_dialogue_text_tbl.rds")

7 R Environment

─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.1 (2022-06-23)
 os       Ubuntu 20.04.5 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Etc/UTC
 date     2023-01-31
 pandoc   2.19.2 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
 package       * version date (UTC) lib source
 assertthat      0.2.1   2019-03-21 [1] RSPM (R 4.2.0)
 backports       1.4.1   2021-12-13 [1] RSPM (R 4.2.0)
 broom           1.0.1   2022-08-29 [1] RSPM (R 4.2.0)
 bslib           0.4.0   2022-07-16 [1] RSPM (R 4.2.0)
 cachem          1.0.6   2021-08-19 [1] RSPM (R 4.2.0)
 cellranger      1.1.0   2016-07-27 [1] RSPM (R 4.2.0)
 cli             3.4.1   2022-09-23 [1] RSPM (R 4.2.0)
 codetools       0.2-18  2020-11-04 [2] CRAN (R 4.2.1)
 colorspace      2.0-3   2022-02-21 [1] RSPM (R 4.2.0)
 conflicted    * 1.1.0   2021-11-26 [1] RSPM (R 4.2.0)
 cowplot       * 1.1.1   2020-12-30 [1] RSPM (R 4.2.0)
 crayon          1.5.2   2022-09-29 [1] RSPM (R 4.2.0)
 crosstalk       1.2.0   2021-11-04 [1] RSPM (R 4.2.0)
 DBI             1.1.3   2022-06-18 [1] RSPM (R 4.2.0)
 dbplyr          2.2.1   2022-06-27 [1] RSPM (R 4.2.0)
 digest          0.6.30  2022-10-18 [1] RSPM (R 4.2.0)
 dplyr         * 1.0.10  2022-09-01 [1] RSPM (R 4.2.0)
 DT            * 0.26    2022-10-19 [1] RSPM (R 4.2.0)
 ellipsis        0.3.2   2021-04-29 [1] RSPM (R 4.2.0)
 evaluate        0.17    2022-10-07 [1] RSPM (R 4.2.0)
 fansi           1.0.3   2022-03-24 [1] RSPM (R 4.2.0)
 farver          2.1.1   2022-07-06 [1] RSPM (R 4.2.0)
 fastmap         1.1.0   2021-01-25 [1] RSPM (R 4.2.0)
 fastmatch       1.1-3   2021-07-23 [1] RSPM (R 4.2.0)
 forcats       * 0.5.2   2022-08-19 [1] RSPM (R 4.2.0)
 fs            * 1.5.2   2021-12-08 [1] RSPM (R 4.2.0)
 furrr         * 0.3.1   2022-08-15 [1] RSPM (R 4.2.0)
 future        * 1.28.0  2022-09-02 [1] RSPM (R 4.2.0)
 gargle          1.2.1   2022-09-08 [1] RSPM (R 4.2.0)
 generics        0.1.3   2022-07-05 [1] RSPM (R 4.2.0)
 ggforce         0.4.1   2022-10-04 [1] RSPM (R 4.2.0)
 ggnetwork     * 0.5.10  2021-07-06 [1] RSPM (R 4.2.0)
 ggplot2       * 3.3.6   2022-05-03 [1] RSPM (R 4.2.0)
 ggraph        * 2.1.0   2022-10-09 [1] RSPM (R 4.2.0)
 ggrepel         0.9.1   2021-01-15 [1] RSPM (R 4.2.0)
 ggwordcloud   * 0.5.0   2019-06-02 [1] RSPM (R 4.2.0)
 globals         0.16.1  2022-08-28 [1] RSPM (R 4.2.0)
 glue          * 1.6.2   2022-02-24 [1] RSPM (R 4.2.0)
 googledrive     2.0.0   2021-07-08 [1] RSPM (R 4.2.0)
 googlesheets4   1.0.1   2022-08-13 [1] RSPM (R 4.2.0)
 graphlayouts    0.8.3   2022-10-20 [1] RSPM (R 4.2.0)
 gridExtra       2.3     2017-09-09 [1] RSPM (R 4.2.0)
 gtable          0.3.1   2022-09-01 [1] RSPM (R 4.2.0)
 haven           2.5.1   2022-08-22 [1] RSPM (R 4.2.0)
 hms             1.1.2   2022-08-19 [1] RSPM (R 4.2.0)
 htmltools       0.5.3   2022-07-18 [1] RSPM (R 4.2.0)
 htmlwidgets     1.5.4   2021-09-08 [1] RSPM (R 4.2.0)
 httr            1.4.4   2022-08-17 [1] RSPM (R 4.2.0)
 hunspell      * 3.0.2   2022-09-04 [1] RSPM (R 4.2.0)
 igraph        * 1.3.5   2022-09-22 [1] RSPM (R 4.2.0)
 janeaustenr     1.0.0   2022-08-26 [1] RSPM (R 4.2.0)
 jquerylib       0.1.4   2021-04-26 [1] RSPM (R 4.2.0)
 jsonlite        1.8.3   2022-10-21 [1] RSPM (R 4.2.0)
 knitr           1.40    2022-08-24 [1] RSPM (R 4.2.0)
 labeling        0.4.2   2020-10-20 [1] RSPM (R 4.2.0)
 lattice         0.20-45 2021-09-22 [2] CRAN (R 4.2.1)
 lifecycle       1.0.3   2022-10-07 [1] RSPM (R 4.2.0)
 listenv         0.8.0   2019-12-05 [1] RSPM (R 4.2.0)
 lubridate       1.8.0   2021-10-07 [1] RSPM (R 4.2.0)
 magrittr      * 2.0.3   2022-03-30 [1] RSPM (R 4.2.0)
 MASS            7.3-57  2022-04-22 [2] CRAN (R 4.2.1)
 Matrix          1.4-1   2022-03-23 [2] CRAN (R 4.2.1)
 memoise         2.0.1   2021-11-26 [1] RSPM (R 4.2.0)
 modelr          0.1.9   2022-08-19 [1] RSPM (R 4.2.0)
 munsell         0.5.0   2018-06-12 [1] RSPM (R 4.2.0)
 parallelly      1.32.1  2022-07-21 [1] RSPM (R 4.2.0)
 pillar          1.8.1   2022-08-19 [1] RSPM (R 4.2.0)
 pkgconfig       2.0.3   2019-09-22 [1] RSPM (R 4.2.0)
 png             0.1-7   2013-12-03 [1] RSPM (R 4.2.0)
 polyclip        1.10-4  2022-10-20 [1] RSPM (R 4.2.0)
 purrr         * 0.3.5   2022-10-06 [1] RSPM (R 4.2.0)
 quanteda      * 3.2.3   2022-08-29 [1] RSPM (R 4.2.0)
 R6              2.5.1   2021-08-19 [1] RSPM (R 4.2.0)
 RColorBrewer    1.1-3   2022-04-03 [1] RSPM (R 4.2.0)
 Rcpp            1.0.9   2022-07-08 [1] RSPM (R 4.2.0)
 RcppParallel    5.1.5   2022-01-05 [1] RSPM (R 4.2.0)
 readr         * 2.1.3   2022-10-01 [1] RSPM (R 4.2.0)
 readxl          1.4.1   2022-08-17 [1] RSPM (R 4.2.0)
 reprex          2.0.2   2022-08-17 [1] RSPM (R 4.2.0)
 rlang         * 1.0.6   2022-09-24 [1] RSPM (R 4.2.0)
 rmarkdown       2.17    2022-10-07 [1] RSPM (R 4.2.0)
 rstudioapi      0.14    2022-08-22 [1] RSPM (R 4.2.0)
 rvest           1.0.3   2022-08-19 [1] RSPM (R 4.2.0)
 sass            0.4.2   2022-07-16 [1] RSPM (R 4.2.0)
 scales        * 1.2.1   2022-08-20 [1] RSPM (R 4.2.0)
 sessioninfo     1.2.2   2021-12-06 [1] RSPM (R 4.2.0)
 SnowballC     * 0.7.0   2020-04-01 [1] RSPM (R 4.2.0)
 stopwords       2.3     2021-10-28 [1] RSPM (R 4.2.0)
 stringi         1.7.8   2022-07-11 [1] RSPM (R 4.2.0)
 stringr       * 1.4.1   2022-08-20 [1] RSPM (R 4.2.0)
 tibble        * 3.1.8   2022-07-22 [1] RSPM (R 4.2.0)
 tidygraph     * 1.2.2   2022-08-22 [1] RSPM (R 4.2.0)
 tidyr         * 1.2.1   2022-09-08 [1] RSPM (R 4.2.0)
 tidyselect      1.2.0   2022-10-10 [1] RSPM (R 4.2.0)
 tidytext      * 0.3.4   2022-08-20 [1] RSPM (R 4.2.0)
 tidyverse     * 1.3.2   2022-07-18 [1] RSPM (R 4.2.0)
 tokenizers      0.2.3   2022-09-23 [1] RSPM (R 4.2.0)
 tweenr          2.0.2   2022-09-06 [1] RSPM (R 4.2.0)
 tzdb            0.3.0   2022-03-28 [1] RSPM (R 4.2.0)
 utf8            1.2.2   2021-07-24 [1] RSPM (R 4.2.0)
 vctrs           0.5.0   2022-10-22 [1] RSPM (R 4.2.0)
 viridis         0.6.2   2021-10-13 [1] RSPM (R 4.2.0)
 viridisLite     0.4.1   2022-08-22 [1] RSPM (R 4.2.0)
 withr           2.5.0   2022-03-03 [1] RSPM (R 4.2.0)
 xfun            0.34    2022-10-18 [1] RSPM (R 4.2.0)
 xml2            1.3.3   2021-11-30 [1] RSPM (R 4.2.0)
 yaml            2.3.6   2022-10-18 [1] RSPM (R 4.2.0)

 [1] /usr/local/lib/R/site-library
 [2] /usr/local/lib/R/library

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────